Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add epbf program to trace podman cleanup errors in CI #23487

Merged
merged 2 commits into from
Sep 24, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 2, 2024

See commits, I just want to check if this even works. Hopefully this allows me to properly understand cleanup errors.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 2, 2024
@Luap99 Luap99 force-pushed the ebpf branch 2 times, most recently from 4cf3a21 to f8a2000 Compare August 2, 2024 16:22
Copy link

Cockpit tests failed for commit f241ed073bd388cd730c44b432c87caec82754c3. @martinpitt, @jelly, @mvollmer please check.

Luap99 added a commit to Luap99/automation_images that referenced this pull request Aug 5, 2024
I like to run a bpftrace based program in CI to collect better logs for
specific processes not observed in the normal testing such as the podman
container cleanup command.

Given you need to have full privs to run ebpf and the package pulls in
an entire toolchain which is almost 500MB in install size we do not add
it the the container images to not bloat them without reason.

containers/podman#23487

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member Author

Luap99 commented Aug 6, 2024

In case anyone is interested here is an example log:
https://api.cirrus-ci.com/v1/artifact/task/6197821693493248/cleanup_tracer/podman-cleanup-tracer.log

It doesn't look nice mostly because the argv lines is printed with zero bytes not spaces, I don't have a way to get rid of that in my program but I guess I could pipe into sed to replace all zero bytes with spaces to make it easier to read.

@Luap99
Copy link
Member Author

Luap99 commented Aug 6, 2024

@edsantiago please pick this one as well for the no retry PR, maybe I can find the cause in this new log.

@edsantiago
Copy link
Member

edsantiago commented Aug 6, 2024

Here ya go f39 root. [EDIT: direct link to cleanup tracer log]. This is my second run; first one did not have any relevant flakes. This run had only one. I'll rerun at least once more today.

@edsantiago
Copy link
Member

This one has two (debian root). It was the only failure in the third run. I've pushed one more time, but will not be checking its status again tonight.

@Luap99
Copy link
Member Author

Luap99 commented Aug 7, 2024

Ok that doesn't seems to include any new things AFAICS, it does confirm what I have been saying the cleanup is called twice. The only way I see this happening is what the cleanup process failed but in the logs exit codes I clearly see no errors or kill signals send. So there truly must be a condition in our code that is calling cleanup twice even if there was no error the first time...

@Luap99
Copy link
Member Author

Luap99 commented Aug 7, 2024

@edsantiago I assume the no flake retry is the only thing I need to trigger this?

I added the parent pid info to my log as it wasn't clear if netavark teardown was run by the some podman command or two different ones.

@edsantiago
Copy link
Member

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

@Luap99
Copy link
Member Author

Luap99 commented Aug 7, 2024

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

Yes that is what I meant.

@Luap99
Copy link
Member Author

Luap99 commented Aug 7, 2024

Interesting I found the netns error also in my log for the cleanup process. So it can happen without us ever seeing as wel and without causing a test failure.

stderr 07:36:29.528707 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Cleaning up container: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped"
stderr 07:36:29.528744 79038    79009    podman      Error: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped
ainerexec   07:36:29.529410 79114    79061    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.529780 79114    79061    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.617636 79114    79061    netavark     0 0
stderr 07:36:29.638209 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to get ips for container ID edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 on network podman-default-kube-network"
ontainer edb052e52aac57efd13979ce3921481aff6baexec   07:36:29.638942 79150    79038    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.639236 79150    79038    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.642237 79150    79038    netavark     1 0
stderr 07:36:29.642647 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
775ab0ca6b567cdd8 on network podman-defaulstderr 07:36:29.642714 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: IO error: No such file or directory (os error 2)"
-network"
ontainer edb052e52aac57efd13979ce3921481aff6bastderr 07:36:29.642747 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Unable to clean up network for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: \"unmounting network namespace for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: failed to remove ns path: remove /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory, failed to unmount NS: at /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory\""
exit   07:36:29.656424 79103    78908    podman       0 0
exit   07:36:29.676194 79038    79009    podman       125 0

https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/cleanup_tracer/podman-cleanup-tracer.log
https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/html/int-podman-rawhide-root-host-sqlite.log.html

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Aug 7, 2024
@Luap99 Luap99 force-pushed the ebpf branch 6 times, most recently from 4f5828b to ed8dc4d Compare August 8, 2024 14:43
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 9, 2024

@edsantiago Great news I am confident that I found the root cause, a locking/container state misuse bug that caused us to add the netns back after we cleaned it up already.
Please give d6bcc14 a try. Hopefully this is enough, I will prepare a proper PR with a better description on Monday.

@edsantiago
Copy link
Member

One new failure seen, f39 root:

# podman [options] network ls --quiet
net5ef1e5e5df3c173e8948e8e487fc7c21c14438e290de0a5a04e3f8b610f60b47
networkIDTest
podman
podman-default-kube-network
time="2024-08-10T17:56:31Z" level=warning msg="Error reading network config file \"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\": EOF"

[FAILED] Unexpected warnings seen on stderr: "time=\"2024-08-10T17:56:31Z\" level=warning msg=\"Error reading network config file \\\"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\\\": EOF\""

Looks more like a race in network ls than anything you've changed, but still, it's a new one I don't remember seeing before.

@edsantiago
Copy link
Member

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

@Luap99
Copy link
Member Author

Luap99 commented Aug 11, 2024

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

Note the same issue (EBUSY vs ENOENT), this should be fixed by containers/common#2112, vendor in #23519

@Luap99
Copy link
Member Author

Luap99 commented Aug 11, 2024

Also d6bcc14 is such a generic issue it would not surprise me if this fixes other weird stop flakes as well.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024
@Luap99 Luap99 changed the title WIP: add epbf program to trace podman cleanup errors in CI add epbf program to trace podman cleanup errors in CI Sep 3, 2024
@Luap99 Luap99 marked this pull request as ready for review September 3, 2024 09:25
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024
Add a new program based on bpftrace[1] to trace all podman processes
with arguments and exit code/signals. Additionally this captures stderr
from all podman container cleanup processes spawned by conmon which
otherwise go to /dev/null and are never seen in any CI logs.
Hopefull this allows us to debug strange network cleanup error seen in
CI, my plan is to add this to the cirrus setup and upload the logs so we
can check them when the flakes happen.

[1] /~https://github.com/bpftrace/bpftrace

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
In order to get better debug data for cleanup flakes. The argv is
printed with 0 bytes so replace them with spaces to make the log
readable for humans.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member Author

Luap99 commented Sep 24, 2024

@edsantiago @mheon PTAL, I would like to get this merged as it just helped me today again for #24044 so I think this is useful in general.

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qualified LGTM (I don't grok ebpf, so I've only re-reviewed the runner changes)

@@ -57,3 +57,4 @@ contrib/win-installer/shasums
contrib/win-installer/*.wixobj
contrib/win-installer/*.wixpdb
contrib/win-installer/*.log
podman-cleanup-tracer.log*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the asterisk? (nonblocking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that was a leftover from a previous iteration where I wrote to a *.tmp file and then called tr later but then I figured I can just pipe into tr and do not need a extra file so we can drop it in a future cleanup I guess

Copy link
Contributor

openshift-ci bot commented Sep 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member

mheon commented Sep 24, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit af4b061 into containers:main Sep 24, 2024
88 checks passed
@Luap99 Luap99 deleted the ebpf branch September 24, 2024 15:49
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 24, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Dec 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants